In this section, we provide a brief background on the - STT-RAM, on-chip networks and 3D integration
technology before discussing our proposal.

%\SubSection{STT-RAM}
\textbf{STT-RAM:} Unlike the traditional SRAM and DRAM technologies that use electric charges as the
information carrier, STT-RAM is based on magnetic characteristics of Magnetic Tunnel Junctions~(MTJs)
and uses MTJ for binary storage. As shown in Figure~\ref{fig:mram}, a MTJ contains two ferromagnetic
layers and one tunnel barrier layer ($MgO$). The direction of one ferromagnetic layer (called
reference layer) is fixed, while the direction of the second layer (called free layer) can be changed
by forcing a driving current. The relative magnetization direction of two ferromagnetic layers
determines the resistance of MTJ. If two ferromagnetic layers have the same directions, the
resistance of MTJ is low, indicating a ``0'' state and vice-versa for a ``1'' state.

%\begin{figure*} [t]
%\begin{minipage}{0.50\textwidth}
\begin{wrapfigure}{l}{3.50in}
\centering
\begin{tabular}{c} %c
 \psfig{figure=figures/mtj_mram.eps, width=1.3in, height=3.50in, angle=-90} \\ %&
 %\psfig{figure=figures/mram_cell.eps, width=1.9in, height=1.2in} \\
\end{tabular}
\hrule
 \caption{\scriptsize \bf MTJ structure and STT-RAM cell (a) Anti-parallel (high resistance),
 indicating ``1'' state (b) Parallel (low resistance), indicating ``0'' state (c) STT-RAM Structural view (d) STT-RAM Schematic.} \label{fig:mram}
%\end{minipage}
%\hfill
%\begin{minipage}{0.50\textwidth}
%\begin{tabular}{c}
%\centering %\hspace{1.5in}
% \psfig{figure=figures/mram_cell.eps, width=3in, height=1.2in}
%\end{tabular}
%\hrule
% \caption{\scriptsize \bf Demonstration of a STT-RAM cell. (a) Structural view. (b) Schematic view.}
% \label{mram_cell}
%\end{minipage}
%\end{figure*}
\end{wrapfigure}

MTJ is the storage element of STT-RAM and a memory cell can be designed using a
\emph{one-transistor-one-MTJ} (``1T1J'') structure~\cite{MRAM:HYY+05,MRAM:KTM+07}. As illustrated in
Figure~\ref{fig:mram}, each MTJ is connected in series with an NMOS. The gate of the NMOS is
connected to the word line (WL), and the NMOS is turned on if it's connected MTJ needs to be accessed
during read or write operations. The source of the NMOS is connected to the source line (SL), and the
free ferromagnetic layer is connected to the bit line (BL). The STT-RAM data read mechanism uses
sense amplifiers to sense the voltage difference caused by the resistance difference of MTJ in ``0''
and ``1'' status. The read latency of STT-RAM can be as short as the SRAM read latency and the read
energy of STT-RAM is also comparable to the SRAM read energy. However, the write duration as well as
the write energy of STT-RAM is significantly higher than that of a SRAM write access. In our paper,
based on a few recently published works on circuit analysis of
STT-RAMs~\cite{MRAM:HYY+05,MRAM:KTM+07}, we use 3 cycles for read latency and 33 cycles for write
latency at 3 GHz. This significant distinction in read/write access times forms the motivation for
our work.

%\SubSection{Network-on-Chip (NoC) architectures}

\textbf{Network-on-Chip (NoC) architectures:} A packet-based NoC provides a scalable interconnection
fabric for connecting the processor nodes, the on-chip shared cache banks and the on-chip memory
controllers~\cite{Dally-DAC}. On-chip routers and links constitute this scalable communication
backbone. A generic NoC router has P input and P output channels/ports; typically P = 5 for a 2D
mesh, one from each cardinal direction, and from the local node. The main components of a router are
the routing computation unit (RC), virtual channel arbitration unit (VA), switch arbitration unit
(SA) and a crossbar to connect the input and output ports. The RC unit is responsible for determining
the next router based on the packet address and the virtual channel (VC) in the next router for each
packet. The VA unit arbitrates amongst all packets requesting access to the same VCs and decides on
winners. The SA unit arbitrates amongst all VCs requesting access to the crossbar and grants
permission to the winning packets/flits. The winners are then able to traverse the crossbar and are
placed on the output links. State-of-the-art wormhole switched NoCs devote two to four pipeline
stages to these components \cite{Peh-dally} and typically employ dimension-ordered routing (e.g. X-Y
routing) to route packets in the network.

The two arbitration stages (VA and SA), where a router must choose one packet/flit among several
packets/flits competing for either a common (a) output VC or (b) crossbar output port, play a major
role in selecting packets for transmission. Current router implementations use simple, local
arbitration policies such as round robin (RR) to decide which packet should be scheduled next. We
propose to modify these local and architecturally oblivious arbiters to prioritize packets for hiding
the STT-RAM write memory latency.

%\SubSection{3D integration technology}
\textbf{3D integration technology:} 3D stacking is a technology that stacks multiple active silicon
die on top of each other, and connects them through wafer bonding. The multiple dies communicate
through Through Silicon Vias (TSVs)~\cite{3d-hpca-2010}. Among various 3D architectures, stacking
cache or memory chips directly on top of a 2D multicore chip~\cite{picoserver, 3D-micro, madan-hpca}
has gained its popularity because such architecture provides fast and high memory bandwidth access
between the core layer and memory layer. We exploit the 3D stacking for putting the STT-RAM based
cache banks in one layer of the CMP configuration. Consequently, all routers in the 3D NoC become 6
ported with one additional downward link for core layer routers and one upward link for cache layer
routers.
